Lzss Compression, or: How I Won Hugi Compo #2
This article describes my program for the second hugi size coding competition.
Task
The task was very simple. You had to write a COM program that displays a predefined text on the screen. The length of the text is 903 bytes, and it must be printed exactly byte per byte.
First of all, a compressor should be developed and then the small decompression routine (in real life you should do this simultaneously, because the com size depends on both code and compressed data).
Compression
I used the well-known lzss algorithm (perhaps lz77 is the correct name). The idea is to replace the repetitions with an offset/size pair, determining the position and size of the previous occurrence. The offset can refer to the position from the beginning of the file or from the current location backwards. To distinguish the normal characters from these offset/size pairs, we must use an extra bit (prefix).
First attempt:
- we must use a bit stream
- the length of the characters is 7 bits
- we use a fixed offset bit length and a fixed size bit length
- The compression algorithm is to search through the text from the beginning and find the longest repetitions in the limits of the offset/size bit lengths. Minimum size limit is 2. (This is a parameter, like offset, and size length in bits. We should try all possibilities for a specific coding.)
With (hopefully) optimal parameters, the patterns look like the following:
7bit character 0 <7 bits>
offset/size repetition 1 <9 bits (offset)> <3 bits (size)>
Simple decoder:
Loop
If GetBits(1) = 0 Then
Text[ Pos ] = GetBits(7)
Pos = Pos + 1
Else
Offset = GetBits(9)
Size = GetBits(3)
For I = 1 to Size
Text[ Pos ] = Text[ Pos - Offset ]
Pos = Pos + 1
No end condition? We can win some bytes in the decoder if we use an end condition like DI >= 8000h (see decoder chapter), but we must append a $ to the input text for the dos print interrupt. Since this little decoder can't freeze, there is no problem when it goes beyond the normal input data.
I don't know the exact size for this compression, but in any case it is not good enough. Where can we optimize it?
We use repetitions only if the size is greater than N. (We will choose the value of N later.) The stored size should be (size-N) to be able to use the full range of x bits. If you make a histogram of the repetitions size parameters, you see that most of them are smaller ones. Another simple coding for the size parameter is to store (size-N) bits with the value 1 and a bit with the value 0 at the end. This way we only need one bit to code a N size repetition.
Second design of the patterns (N=2)
7bit character 0 <7 bits>
offset/size repetition 1 <9 bits (offset)> 1*<(size-2) bits> 0
If you write your own lzss compression routine, you will probably question if this is the optimal compression for this pattern. What if you use another repetition in the beginning and lose some bits, but at the end you will get a smaller result? My first compressor went from the beginning to the end and used the longest repetitions.
Second attempt: Try all possible patterns, not only the longest repetitions, at every character. Bad method: plenty of cases, years to test them all. But I realized that compression depends only on earlier source text, not on the used patterns to compress it. For example, it doesn't influence the further compression if I use repetitions or 7bit characters instead, only the resulting bitstream would probably be longer.
New idea: Let's try to compress from the end to the beginning! We calculate the best compression for every character as it would be the first. Since we know the optimal compressed bitstream for all later positions (remember, we compress back to the beginning), we can calculate the resulting bitstream length for all possible patterns, and select the minimal one.
size[pos] = min( pattern(N).bits + size[pos + pattern(N).codedbytes] )
For our current patterns:
7bit character -> bits = 1+7, codedbytes = 1
offset/size repetition -> bits = 1+9+(size-2)+1, codedbytes = size
(We must try a repetition pattern with every possible size, not only with the longest!)
Those 9bit offsets are very big. What if we make two types of repetition patterns? A smaller and a larger one. If the pattern count is about the same for these two types, we distinguish them with one additional prefix bit best.
Third design of the patterns (N1=2, N2=3)
7bit character 0 <7 bits>
small offset/size rep. 1 0 <6 bits> 1*<(size-2) bits> 0
large offset/size rep. 1 1 <9 bits> 1*<(size-3) bits> 0
We have a small character set, only few upper case letters, four punctuation marks, newline chars, space.
- the first char in the file is upper case
- all other upper case letters come after '?' and '.'
- almost all lines are 44 char long
Let's make a 5bit character set: 'a'..'y' ',' '.' '?' '-'. This is 25+4=29, we still have space for 3 chars. We need one for space, one for $ at the end and one char for special newline, when the line is smaller than 44. Before compression we have to convert the input text: lower case, filter out the newline chars, and $ at the end. Where line is not 44 char long, a special character is needed.
Final design of the patterns
5bit character 0 <5 bits>
small offset/size rep. 1 0 <6 bits> 1*<(size-2) bits> 0
large offset/size rep. 1 1 <9 bits> 1*<(size-3) bits> 0
Compressed data is 395 bytes.
Decoder
We have a good compression algo, now need a small decoding routine. I will explain my final code, unfortunately I have no previous versions.
a. The first problem is the bit stream. We have a very lovely bit test instruction (386+). It can access a -32768..+32767 bit range from memory and return the state in cflag. I use BX for base pointer and BP as bit index.
b. I decided to implement the 5bit -> 7bit conversion in the first pass. The 5bit range has to parts: an alphabetical A..Y range, and the special 7 characters. The first part is simple to decode, but the second needs a lookup table. The smallest byte lookup instruction is xlat -> BX must point to the beginning of the 7byte loopkup table. Since SI=100h at startup, we can easily set BX to 100h, and hope for a non-destructive loopkup table as code at 100h. The first byte is something special because it can be in a range and we can use it for our purpose. The final lookup table:
push si ; newline (negative byte) - 'Z'
ror dl,cl ; , $ (ascii code) - 'Z'
rol si,cl ; - space (ascii code) - 'Z'
mov ah,0C5h ; . ? (ascii code) - 'Z' - 32
Fortunately non-destructive, and we have a 100h on the stack.
pop bx ; bx = 100h
c. Now we have to set up BP for the bit test instruction.
mov bp,(offset data-offset start)*8
d. We also need a destination pointer. The end condition will be a sign check, so it must be less than 8000h-textlength, and it must be larger than 100h+COMlength. We will use STOSB and MOVSB, so DI is the right register (CS=ES=DS), and BP is just fine for initialization.
mov di,bp ; di = free space below 0x8000
e. DI is needed as source for the second pass.
push di
f. We can start the decoding pass. In the first versions I made a GetBits routine as in the example program above. Later, I had a better idea. I read 11 bits at once and after decoding we rewind as needed.
decode: mov cl,11 ; get 11 bits
bitloop:
bt [bx],bp
inc bp
rcr ax,1 ; set msb in ax
loop bitloop
shr ax,6 ; first bit to cflag, others stay in ax
g. Now we have 10 bits in AX, and the first bit in cflag. This is very useful because we can easily decide if we have a repetition pattern or a 5bit character.
jc move ; move token ?
h. 5bit->7bit converting is simple now, as we have the lookup table at [BX]. But, first we must rewind our bit pointer by 5bits because we used only 1+5 bits out of 11.
sub bp,5 ; go back 5bits
and al,31 ; only 5bit used (11-1-5=5)
sub al,25 ; special char?
js abc
xlat ; lookup special char from start[0..6]
abc: add al,90 ; add 'Z'
stosb
jmp ok
i. In case of repetition: the 10 bits in AX are for the offset. The LSB decides the type and whether only 6 bits are used and whether rewind is necessary. I use CX=3 as the minimal repetition size, and decrement in case of small type repetition.
move: mov cl,3 ; mincount = 3
shr ax,1 ; check next bit
jc size9 ; type0 or type1 token?
sub bp,cx ; go back 3bits
and ax,63 ; only 6bit used (11-1-1-3=6)
dec cx ; mincount = 2
size9:
j. We have the offset in AX, now let's copy those bytes. There is one trick: after I copy the minimal size, I reuse the MOVSB part of the REP MOVSB to copy the rest.
mov si,di ; source = (current pos)-(data from ax)
sub si,ax
moving: rep movsb ; first move mincount
bt [bx],bp
inc bp
jc moving+1 ; move one bytes until 0 bit found
k. The end condition of the first pass.
ok: or di,di
jns decode ; loop until di >= 0x8000
; PASS 2
pop si ; si=stage1,di>=0x8000,bx=0x100,dx=0
push di
cwd
j. Here si = pass1 7bit output, di can be used as pass2 output, and BX=100, CX=0, DX=0. The second part is much more compilcated. As you could see at the lookup table, we coded the '.' and '?' as '.'-32 and '?'-32 to distinguish them from the other special characters. Now we can specify 4 input ascii code ranges:
Range1 0..31 for '.' and '?'
Range2 32..63 for other non alphabetical characters
Range3 64..95 for 'A'..'Y'
Range4 128..255 for newline (signed!)
We have two converting modes:
Normal - add 32 to Range1 and Range3 characters (or bitwise OR 32)
Upcase - leave everything unchanged
Mode change conditions:
Normal->Upcase - at startup and after Range1 characters
Upcase->Normal - only after Range3 characters
From the range limits you can see the importance of the 5th bit. By converting we can bitwise OR the mode bit to the 5th bit, if we assume bit-value 1 for Normal mode and bit-value 0 for Upcase mode. For mode changes we use ADD for mode bit and the original 5th bit and use carry as the next mode bit.
However, newline chars must be detected before all this converting and must break the line loop which should have the default size of 44 chars. This bit mode is a little confusing. We can use a normal register like DL to this purpose, and use only the 5th bit. For normal mode DL=20h and for Upcase mode 0h, which is the initial mode.
postadj:mov cl,44 ; assume line will be 44 chars
line: lodsb
or al,dl ; case dl=0 nothing done, upper case
; case dl=32
; case al=A..Z -> a..z
; case al=, $ - space -> nothing done
; case al=.-32 ?-32 -> normal . ?
js newline ; if special negative byte, break loop
stosb ; store adjusted char
add dl,[si-1] ; calc new dl
shr dl,1 ; when al=A..Z -> dl=32
and dl,20h ; when al=, $ - space -> dl=dl
; when al=.-32 ?-32 -> dl=0
loop line
newline:mov ax,0A0Dh ; store newline chars
stosw
k. Wow, we processed one line, wrote the normal newline charaters and everything is ok, but wait, we don't have any end conditions as usual. We can use BX as a line counter, we have plenty of space for those max. 44 long lines, just decoding 256 lines should be enough.
dec bx ; line counter
jne postadj
l. At last, we can display our text.
pop dx ; print text
mov ah,9
int 21h
ret
m. Just insert the compressed data here.
data: ; label for initializing BP
Other Solutions
I could reach the same COM size with a simpler pattern too:
5bit character 0 <5 bits>
offset/size rep. 1 <9 bits> 1*<(size-3) bits> 0
Code 95 bytes + Data 407 bytes = 501 (my final version is 106+396=501)
Another way to reduce the 9bit offset is to use a changing offset bit length. 386+ supports the bsr instruction, which scans for the first bit with the value 1 from msb, so we can easily find the log2 of the current position. The new rep. pattern would be:
1 <log2(pos) bits> 1*<(size-N bits>) 0
Unfortunately, the resulting COM file was about 507 bytes :(